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ABSTRACT 

In large systems, it is important for agents to learn to act ef- 
fectively, but sophisticated multi-agent learning algorithms 
generally do not scale. An alternative approach is to find re- 
stricted classes of games where simple, efficient algorithms 
converge. It is shown that stage learning efficiently con- 
verges to Nash equilibria in large anonymous games if best- 
reply dynamics converge. Two features are identified that 
improve convergence. First, rather than making learning 
more difficult, more agents are actually beneficial in many 
settings. Second, providing agents with statistical informa- 
tion about the behavior of others can significantly reduce 
the number of observations needed. 

Categories and Subject Descriptors 

1.2.11 [Artificial Intelligence] ; Distributed Artificial Intel- 
ligence — Multiagent systems; J. 4 [Social and Behavioral 
Sciences]: Economics 

General Terms 

Algorithms, Economics, Theory 

Keywords 

Multiagent Learning, Game Theory, Large Games, Anony- 
mous Games, Best-Reply Dynamics 

1. INTRODUCTION 

Designers of distributed systems are frequently unable to 
determine how an agent in the system should behave, be- 
cause optimal behavior depends on the user's preferences 
and the actions of others. A natural approach is to have 
agents use a learning algorithm. Many multiagent learning 
algorithms have been proposed including simple strategy up- 
date procedures such as fictitious play 10 , multiagent ver- 
sions of Q-learmng 25 , and no-regret algorithms 5 . 

However, as we discuss in Section (2] existing algorithms 
are generally unsuitable for large distributed systems. In a 
distributed system, each agent has a limited view of the ac- 
tions of other agents. Algorithms that require knowing, for 
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example, the strategy chosen by every agent cannot be im- 
plemented. Furthermore, the size of distributed systems re- 
quires fast convergence. Users may use the system for short 
periods of time and conditions in the system change over 
time, so a practical algorithm for a system with thousands 
or millions of users needs to have a convergence rate that is 
sublinear in the number of agents. Existing algorithms tend 
to provide performance guarantees that are polynomial or 
even exponential. Finally, the large number of agents in the 
system guarantees that there will be noise. Agents will make 
mistakes and will behave in unexpectedly. Even if no agent 
changes his strategy, there can still be noise in agent payoffs. 
For example, a gossip protocol will match different agents 
from round to round; congestion in the underlying network 
may effect message delays between agents. A learning algo- 
rithm needs to be robust to this noise. 

While finding an algorithm that satisfies these require- 
ments for arbitrary games may be difficult, distributed sys- 
tems have characteristics that make the problem easier. First, 
they involve a large number of agents. Having more agents 
may seem to make learning harder — after all, there are more 
possible interactions. However, it has the advantage that the 
outcome of an action typically depends only weakly on what 
other agents do. This makes outcomes robust to noise. Hav- 
ing a large number of agents also make it less useful for an 
agent to try to influence others; it becomes a better policy to 
try to learn an optimal response. In contrast, with a small 
number of agents, an agent can attempt to guide learning 
agents into an outcome that is beneficial for him. 

Second, distributed systems are often anonymous [T]; it 
does not matter who does something, but rather how many 
agents do it. For example, when there is congestion on a link, 
the experience of a single agent does not depend on who is 
sending the packets, but on how many are being sent. 

Finally, and perhaps most importantly, in a distributed 
system the system designer controls the game agents are 
playing. This gives us a somewhat different perspective than 
most work, which takes the game as given. We do not need 
to solve the hard problem of finding an efficient algorithm 
for all games. Instead, we can find algorithms that work 
efficiently for interesting classes of games, where for us "in- 
teresting" means "the type of games a system designer might 
wish agents to play." Such games should be "well behaved," 
since it would be strange to design a system where an agent's 
decisions can influence other agents in pathological ways. 

In Section [3l we show that stage learning [9] is robust, 
implementable with minimal information, and converges ef- 
ficiently for an interesting class of games. In this algorithm, 



agents divide the rounds of the game into a series of stages. 
In each stage, the agent uses a fixed strategy except that 
he occasionally explores. At the end of a stage, the agent 
chooses as his strategy for the next stage whatever strategy 
had the highest average reward in the current stage. We 
prove that, under appropriate conditions, a large system of 
stage learners will follow (approximate) best-reply dynamics 
despite errors and exploration. 

For games where best-reply dynamics converge, our theo- 
rem guarantees that learners will play an approximate Nash 
equilibrium. In contrast to previous results where the con- 
vergence guarantee scales poorly with the number of agents, 
our theorem guarantees convergence in a finite amount of 
time with an infinite number of agents. While the assump- 
tion that best-reply dynamics converge is a strong one, many 
interesting games converge under best-reply dynamics, in- 
cluding dominance solvable games and games with mono- 
tone best replies. Marden et al. ;17^ have observed that 
convergence of best-reply dynamics is often a property of 
games that humans design. Moreover, convergence of best- 
reply dynamics is a weaker assumption than a common as- 
sumption made in the mechanism design literature, that the 
games of interest have dominant strategies (each agent has 
a strategy that is optimal no matter what other agents do). 

Simulation results, presented in Section |4l show that con- 
vergence is fast in practice: a system with thousands of 
agents can converge in a few thousand rounds. Further- 
more, we identify two factors that determine the rate and 
quality of convergence. One is the number of agents: having 
more agents makes the noise in the systen more consistent 
so agents can learn using fewer observations. The other 
is giving agents statistical information about the behavior 
of other agents; this can speed convergence by an order of 
magnitude. Indeed, even noisy statistical information about 
agent behavior, which should be relatively easy to obtain 
and disseminate, can significantly improve performance. 

2. RELATED WORK 

One approach to learning to play games is to generalize 
reinforcement learning algorithms such as Q-learning [25) . 
One nice feature of this approach is that it can handle games 
with state, which is important in distributed systems. In 
Q-learning, an agent associates a value with each state- 
action pair. When he chooses action a in state s, he up- 
dates the value Q{s,a) based on the reward he received 
and the best value he can achieve in the resulting state s' 
(maxa' a')). When generalizing to multiple agents, s 
and a become vectors of the state and action of every agent 
and the max is replaced by a prediction of the behavior 
of other agents. Different algorithms use different predic- 
tions; for example, Nash-Q uses a Nash equilibrium calcula- 
tion [15]. See [22] for a survey. 

Unfortunately, these algorithms converge too slowly for 
a large distributed system. The algorithm needs to expe- 
rience each possible action profile many times to guarantee 
convergence. So, with n agents and k strategies, the naive 
convergence time is 0{k"). Even with a better represen- 
tation for anonymous games, the convergence time is still 
0{n'') (typically fc <C n). There is also a more fundamen- 
tal problem with this approach: it assumes information that 
an agent is unlikely to have. In order to know which value 
to update, the agent must learn the action chosen by ev- 
ery other agent. In practice, an agent will learn something 



about the actions of the agents with whom he directly in- 
teracts, but is unlikely to gain much information about the 
actions of other agents. 

Another approach is no-regret learning, where agents choose 
a strategy for each round that guarantees that the regret of 
their choices will be low. Hart and Mas-Colell [13) present 
such a learning procedure that converges to a correlated equi- 
libnum [21] given knowledge of what the payoffs of every 
action would have been in each round. They also provide 
a variant of their algorithm that requires only information 
about the agent's actual payoffs 14 . However, to guarantee 
convergence to within e of a correlated equilibrium requires 
0{kn/e^ log kn), still too slow for large systems. Further- 
more, the convergence guarantee is that the distribution of 
play converges to equilibrium; the strategies of individual 
learners will not converge. Better results can be achieved 
in restricted settings. For example, Blum et al. 2 showed 
that in routing games a continuum of no-regret learners will 
approximate Nash equilibrium in a finite amount of time. 

Foster and Young [7] use a stage-learning procedure that 
converges to Nash equilibrium for two-player games. Ger- 
mano and Lugosi 11: showed that it converges for generic n- 
player games (games where best replies are unique). Young [26] 
uses a similar algorithm without explicit stages that also 
converges for generic n-player games. Rather than selecting 
best replies, in these algorithms agents choose new actions 
randomly when not in equilibrium. Unfortunately, these 
algorithms involve searching the whole strategy space, so 
their convergence time is exponential. Another algorithm 
that uses stages to provide a stable learning environment is 
the ESRL algorithm for coordinated exploration ,24) . 

Marden et al. [181 [T5] use an algorithm with experimen- 
tation and best replies but without explicit stages that con- 
verges for weakly acyclic games, where best-reply dynamics 
converge when agents move one at a time, rather than mov- 
ing all at once, as we assume here. Convergence is based on 
the existence of a sequence of exploration moves that lead 
to equilibrium. With n agents who explore with probability 
e, this analysis gives a convergence time of 0{l/e"). Fur- 
thermore, the guarantee requires e to be sufficiently small 
that agents essentially explore one at a time, so e needs to 
be 0(l/n). 

There is a long history of work examining simple learn- 
ing procedures such as fictitious play [TO] , where each agent 
makes a best response assuming that each other player's 
strategy is characterized by the empirical frequency of his 
observed moves. In contrast to algorithms with convergence 
guarantees for general games, these algorithms fail to con- 
verge in many games. But for classes of games where they 
do converge, they tend to do so rapidly. However, most work 
in this area assumes that the actions of agents are observed 
by all agents, agents know the payoff matrix, and payoffs are 
deterministic. A recent approach in this tradition is based 
on the Win or Learn Fast principle, which has limited con- 
vergence guarantees but often performs well in practice [4]. 

There is also a body of empirical work on the convergence 
of learning algorithms in multiagent settings. Q-learning has 
had empirical success in pricing games [53], n-player coop- 
erative games [6], and grid world games [3]. Greenwald at 
al. iji2, showed that a number of algorithms, including stage 
learning, converge in a variety of simple games. Marden et 
al. [19] found that their algorithm converged must faster in a 
congestion game than the theoretical analysis would suggest. 
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Our theorem suggests an explanation for these empirical ob- 
servations: best-reply dynamics converge in all these games. 
While our theorem applies directly only to stage learning, it 
provides intuition as to why algorithms that learn "quickly 
enough" and change their behavior "slowly enough" rapidly 
converge to Nash equilibrium in practice. 

3. THEORETICAL RESULTS 
3.1 Large Anonymous Games 

We are interested in anonymous games with countably 
many agents. Assuming that there are countably many 
agents simplifies the proofs; it is straightforward to extend 
our results to games with a large finite number of agents. 
Our model is adapted from that of [l]. Formally, a large 
anonymous game is characterized by a tuple F — (N, A, P, Pr) 

• N is the countably infinite set of agents. 

• yl is a finite set of actions from which each agent can 
choose (for simplicity, we assume that each agent can 
choose from the same set of actions). 

• A{A), the set of probability distributions over A, has 
two useful interpretations. The first is as the set of 
mixed actions. For a £ A we will abuse notation and 
denote the mixed action that is a with probability 1 as 
a. In each round each agent chooses one of these mixed 
actions. The second interpretation of p £ A{A) is as 
the fraction of agents choosing each action a £ A. This 
is important for our notion of anonymity, which says 
an agent's utility should depend only on how many 
agents choose each action rather than who chooses it. 

• G = {(? : N ^ A(A)} is the set of (mixed) action 
profiles (i.e. which action each agent chooses). Given 
the mixed action of every agent, we want to know the 
fraction of agents that end up choosing action a. For 
g £ G, let g{i){a) denote the probability with which 
agent i plays a according to g(i) £ A(y4). We can then 
express the fraction of agents in g that choose action 
a as lim„^oo(l/n) X^"^Q g(j)(t[), if this limit exists. If 
the limit exists for all actions a £ A, let pg £ A(j4) 
give the value of the limit for each a. The profiles g 
that we use are all determined by a simple random 
process. For such profiles g, the strong law of large 
numbers (SLLN) guarantees that with probability 1 pg 
is well defined. Thus it will typically be well defined 
(using similar limits) for us to talk about the fraction 
of agents who do something. 

• P C R is a finite set of payoffs agents can receive. 

• Vi : A X A(A) —* A(P) denotes the distribution over 
payoffs that results when the agent performs action 
a and other agents follow action profile p. We use a 
probability distribution over payoffs rather than a pay- 
off to model the fact that agent payoffs may change 
even if no agent changes his strategy. The expected 
utility of an agent who performs mixed action s when 
other agents follow action distribution p is u{s, p) — 
ZiasA I]pgpP*(") P'"<'.p(p)- Our definition of Pr in 
terms of A{A) rather than G ensures the the game 
is anonymous. We further require that Pr (and thus 



u) be Lipschitz continuous^ For definiteness, we use 
the LI norm as our notion of distance when specifying 
continuity (the LI distance between two vectors is the 
sum of the absolute values of the differences in each 
component). Note that this formulation assumes all 
agents share a common utility function. 

An example of a large anonymous game is one where, in 
each round, each agent plays a two-player game against an 
opponent chosen at random. Then A is the set of actions 
of the two-player game and P is the set of payoffs of the 
game. Once every agent chooses an action, the distribution 
over actions is characterized by some p £ A{A). Let p^.a' 
denote the payoff for the agent if he plays a and the other 
agent plays a'. Then the utility of mixed action s given 
distribution p is 

"(s,p)= ^ s{a)p{a')pa^a'- 

3.2 Best- Reply Dynamics 

Given a game F and an action distribution p, a natural 
goal for an agent is to play the action that maximizes his 
expected utility with respect to p: argmax^g^ u(a, p). We 
call such an action a best reply to p. In a practical amount 
of time, an agent may have difficulty determining which of 
two actions with close expected utilities is better, so we will 
allow agents to choose actions that are close to best replies. 
If a is a best reply to p, then a' is an rj-best reply to p if 
u{a' , p) + rj > u{a,p). There may be more than one 77-best 
reply; we denote the set of ?7-best replies ABR^{p). 

We do not have a single agent looking for a best reply; 
every agent is trying to find a one at the same time. If 
agents start off with some action distribution po, after they 
all find a best reply there will be a new action distribution pi. 
We assume that po{a) ~ i/\A\ (agents choose their initial 
strategy uniformly at random) , but our results apply to any 
distribution used to determine the initial strategy. We say 
that a sequence (po, pi, ■ • ■) is an r]-hest-reply sequence if the 
support of pi+i is a subset of ABRr,{pi); that is pi+i gives 
positive probability only to approximate best replies to pi. 
A rj best-reply sequence converges if there exists some t such 
that for all t' > t, p^i = pt. Note that this is a particularly 
strong notion of convergence because we require the pt to 
converge in finite time and not merely in the limit. A game 
may have infinitely many best-reply sequences, so we say 
that approximate best-reply dynamics converge if there exists 
some r) > Q such that every 77-best-reply sequence converges. 
The limit distribution pt determines a mixed strategy that 
is an r;-Nash equilibrium. 

Our theorem shows that learners can successfully learn in 
large anonymous games where approximate best-reply dy- 
namics converge. The number of stages needed to converge 
is determined by the number of best replies needed before 
the sequence converges. It is possibly to design games that 
have long best-reply sequences, but it practice most games 
have short sequences. One condition that guarantees this is 
if Po and all the degenerate action distributions a £ A (i.e., 

^Lipschitz continuity imposes the additional constraint that 
there is some constant K such that | Pr(a, p)— Pr(a, p')\/\\p— 
p'lli < K for all p and p'. Intuitively, this ensures that the 
distribution of outcomes doesn't change "too fast." This is a 
standard assumption that is easily seen to hold in the games 
that have typically been considered in the literature. 
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distributions that assign probability 1 to some a £ A) have 
unique best rephes. In this case, there can be at most \A\ 
best rephes before equilibrium is reached. Furthermore, in 
such games the distinction between r;-best replies and best 
replies is irrelevant; for sufficiently small rj, a 77-best reply is 
a best reply. It is not hard to show that the property that 
degenerate strategies have unique best replies is generic; it 
holds for almost every game. 

3.3 Stage Learners 

An agent who wants to find a best reply may not know the 
set of payoffs P, the mapping from actions to distributions 
over payoffs Pr, or the action distribution p (and, indeed, p 
may be changing over time) , so he will have to use some type 
of learning algorithm to learn it. Our approach is to divide 
the play of the game into a sequence of stages. In each stage, 
the agent almost always plays some fixed action a, but also 
explores other actions. At the end of the stage, he chooses a 
new a' for the next stage based on what he has learned. An 
important feature of this approach is that agents maintain 
their actions for the entire stage, so each stage provides a 
stable environment in which agents can learn. To simplify 
our results, we specify a way of exploring and learning within 
a stage (originally described in [9j), but our results should 
generalize to any "reasonable" learning algorithm used to 
learn within a stage. (We discuss what is "reasonable" in 
Section [5]) In this section, we show that, given a suitable 
parameter, at the each stage most agents will have learned 
a best reply to the environment of that stage. 

Given a game F, in each round t agent i needs to se- 
lect a mixed action Si^t- Our agents use strategies that we 
denote a^, for a £ A, where a^{a) = 1 — e and a^{a' 7^ 
a) = e/(|^| — 1). Thus, with a^, an agent almost always 
plays a, but with probability e explores other strategies uni- 
formly at random. Thus far we have not specified what 
information an agent can use to choose Si^t- Different games 
may provide different information. All that we require is 
that an agent know all of his previous actions and his pre- 
vious payoffs. More precisely, for all t' < t, he knows his 
action 0^/(1) (which is determined by s^ i/) and his payoffs 
Pt'{i) (which is determined by Pr(ai_t/ , p^/), where pt' is the 
action distribution for round t'; note that we do not as- 
sume that the agent knows Pfi .) Using this information, we 
can express the average value of an action over the pre- 
vious r = rounds (the length of a stage) Let 
H{a,i,t) — {t — T < t' < t \ afi (i) = a} be the set of recent 
rounds in which a was played by i. Then the average value 
is V{a,i,t)=J2^,^^ pt,{i)/\H{a,i,t)\ a \H{a,i,t)\ > 
and otherwise. While we need the value of H only at 
times that are multiples of r, for convenience we define it 
for arbitrary times t. 

We say that an agent is an e-stage learner if he chooses 
his actions as follows. If t = 0, St is chosen at random from 
{flj I a £ A}. If t is a nonzero multiple of r, Si^t = a{i,t)t 
where a(i,t) = argmax^g^ V(a,i,t). Otherwise, Si^t = Si,t-i. 
Thus, within a stage, his mixed action is fixed and at the end 
of a stage he updates it to use the action with the highest 
average value during the previous stage. 

The evolution of a game played by stage learners is not 
deterministic; each agent chooses a random Si,o and the se- 

^The use of the exponent 2 is arbitrary. We require only 
that the expected number of times a strategy is explored 
increases as e decreases. 



quence of at{i) and pt{i) he observes is also random. How- 
ever, with a countably infinite set of agents, we can use the 
SLLN to make statements about the overall behavior of the 
game. Let gt(i) = Si^t- A run of the game consists of a 
sequence of triples {gt,at,pt)- The SLLN guarantees that 
with probability 1 the fraction of agents who choose a strat- 
egy a in at is pg^{a). Similarly, the fraction of agents who 
chose a in at that receive payoff p will be Pr(a, pgt)(p) with 
probability 1. 

To make our notion of a stage precise, we refer to the se- 
quence of tuples {gnT,anT,PnT) ■ ■ ■ (g(n+l)r-l,a(n+l)T-l,P(n + l)T-l) 

as stage n of the run. During stage n there is a stationary 
action distribution that we denote Pg^^- If Si,(n+i)T ~ 
and a £ ABRnignr), then we say that agent i has learned 
an r]-best reply during stage n of the run. As the following 
lemma shows, for sufficiently small e, most agents will learn 
an 77-best reply. 

Lemma 3.1. For all large anonymous games F, action 
profiles, approximations rj > 0, and probabilities of error 
e > 0, there exists an e* > such that for e < e* and all n, 
if all agents are e-stage learners, then at least al — e fraction 
of agents will learn an rj-best reply during stage n. 

Proof. (Sketch) On average, an agent using strategy 
plays action a (1 — e)r times during a stage and plays all 
other actions er/(n— 1) times each. For r large, the realized 
number of times played will be close to the expectation value 
with high probability. Thus, if er is sufficiently large, then 
the average payoff from each action will be exponentially 
close to the true expected value (via a standard Hoeffding 
bound on sums of i.i.d. random variables), and thus each the 
learner will correctly identify an action with approximately 
the highest expected payoff with probability at least 1 — e. 
By the SLLN, at least al — e fraction of agents will learn 
an ?7-best reply. A detailed version of this proof in a more 
general setting can be found in 'Q^. □ 

3.4 Convergence Theorem 

Thus far we have defined large anonymous games where 
approximate best-reply dynamics converge. If all agents in 
the game are e-stage learners, then the sequence po, pi, ... of 
action distributions in a run of the game is not a best-reply 
sequence, but it is close. The action used by most agents 
most of the time in each p„ is the action used in p„ for some 
approximate best reply sequence. 

In order to prove this, we need to define "close." Our 
definition is based on the error rate e and exploration rate 
e that introduces noise into p„. Intuitively, distribution p 
is close to p if, by changing the strategies of an e fraction 
of agents and having all agents explore an e fraction of the 
time, we can go from an action profile with corresponding 
action distribution p to one with corresponding distribution 
p. Note that this definition will not be symmetric. 

In this definition, g identifies what (pure) action each 
agent is using that leads to p, g' allows an e fraction of 
agents to use some other action, and g incorporates the fact 
that each agent is exploring, so each strategy is an (the 
agent usually plays a but explores with probability e). 

Definition 3.2. Action distribution p (e, e)-close to p if 
there exist g, g' , and g £ G such that: 

• p — pg and p — p'g-. 
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• g{i) G A for all i G N; 

• IIPs ~ Pa' 111 — 2e (this allows an e fraction of agents 
in g' to play a different strategy from g ); 

• for some e < e, if g'{i) = a then g{i) — a^i . | 

The use of e' in the final requirement ensures that if two 
distributions are (e, e)-close then they are also (e', e')-close 
for all e' > e and e' > e. As an example of the asymmetry 
of this definition, Oe is (0, e) close to a, but the reverse is 
not true. While (e, e)-closeness is a useful distance measure 
for our analysis, it is an unnatural notion of distance for 
specifying the continuity of u, where we used the LI norm. 
The following simple lemma shows that this distinction is 
unimportant; if p is sufiiciently (e,e)-close to p then it is 
close according to the LI measure as well. 

Lemma 3.3. If p is {e,e)-close to p, then \\p — p\\i < 2{e + 

Proof. Since p is (e,£)-close to p, there exist g, g' , and 
g as in Definition 13.21 Consider the distributions pg = p, 
Pgi , and Pg = p. We can view these three distributions as 
vectors, and calculate their LI distances. By Definition 13.21 
\\pg — Pg' 111 < 2e. \\pgi — Pglli < 2e because an e fraction 
of agents explore. Thus by the triangle inequality, the LI 
distance between p and p is at most 2(e + e). □ 

We have assumed that approximate best reply sequences 
of pn converge, but during a run of the game agents will 
actually be learning approximate best replies to p„. The 
following lemma shows that this distinction does not matter 
if p and p are sufiiciently close. 

Lemma 3.4. For all ij there exists a such that if p 
is {e,e) -close to p, e > 0, e > 0, and e + e < dn then 
ABR(^/2){p) C ABRnip). 

Proof. Let K be the maximum of the Lipschitz con- 
stants for all u{a, •) and dr, — ri/{8K). Then for all p that 
are (e, e)-close to p and all a, \u{a, p—u(a, p) \ < \\p — p\\iK < 
2ti/(8K)K = »7/4 by Lemma [3^ 

Let a <f ABRn{p) and a' G argmax^/g^g^^^^j u(a', p). 
Then u{a, p) + ri < u{a' , p). Combining this with the above 
gives u{a,p) +ri/2 < u{a',p). Thus a ^ ABR^/2{p)- □ 

Lemmas 13.11 and 13.41 give requirements on (e,e). In the 
statement of the theorem, we call (e, e) rj-acceptable if they 
satisfy the requirements of both lemmas for ri/2 and all 77- 
best-reply sequences converge in F. 

Theorem 3.5. Let F be a large anonymous game where 
approximate best-reply dynamics converge and let (e, e) be 
r]-acceptable for F. // all agents are e-stage learners then, 
for all runs, there exists an rj-best-reply sequence po,Pi, • ■ • 
such that in stage n at least a 1 — e fraction will learn a best 
reply to pn with probability 1. 

Proof, po = Po, so pb is (e,e)-close to p. Assume p„ is 
(e, £)-close to p. By Lemma [3. II at least a 1 — e fraction will 
learn a ?7/2-best reply to pn- By Lemma [3.4l this is a rj-hest 
reply to p„. Thus p„+i will be (e,e)-close to pn+i. D 

Theorem l3 . 5 l euarantees that after a finite number of stages, 
agents will be close to an approximate Nash equilibrium pro- 
file. Specifically, pn will be (e, e)-close to an T;-Nash equi- 
librium profile pn- Note that this means that p„ is actually 



an ?7'-Nash equilibrium for a larger r)' that depends on rj,e,e, 
and the Lipschitz constant K. 

Our three requirements for a practical learning algorithm 
were that it require minimal information, converge quickly 
in a large system, and be robust to noise. Stage learning re- 
quires only that an agent know his own payoffs, so the first 
condition is satisfied. Theorem 13.51 shows that it satisfies 
the other two requirements. Convergence is guaranteed in 
a finite number of stages. While the number of stages de- 
pends on the game, in Section [3.21 we argued that in many 
cases it will be quite small. Finally, robustness comes from 
tolerating an e fraction of errors. While in our proofs we 
assumed these errors were due to learning, the analysis is 
the same if some of this noise is from other sources such 
as churn (agents entering and leaving the system) or agents 
making errors. We discuss this issue more in Section [5] 

4. SIMULATION RESULTS 

Theorem l3.5l guarantees convergence for a sufficiently small 
exploration probability e, but decreasing e also increases r, 
the length of a stage. Increasing the length of a stage means 
that agents take longer to reach equilibrium, so for stage 
learning to be practical, e needs to be relatively large. To 
show that e can be large in practice, we tested populations 
of stage learners in a number of games where best reply 
dynamics converge and experienced convergence with e be- 
tween 0.01 and 0.05. This allows convergence within a few 
thousand rounds in many games. While our theorem applies 
only to stage learning, the analysis provides intuition as to 
why a reasonable algorithm that changes slowly enough that 
other learners have a chance to learn best replies should con- 
verge as well. To test a very different type of algorithm, we 
also implemented the no-regret learning algorithm of Hart 
and Mas-CoUell [14]. This algorithm also quickly converged 
close to Nash equilibrium, although in many games it did 
not converge as closely as stage learning. 

Our theoretical results make two significant predictions 
about factors that influence the rate of convergence. Lemma r3.1l 
tells us that the length of a stage is determined by the num- 
ber of times each strategy needs to be explored to get an 
accurate estimate of its value. Thus the amount of infor- 
mation provided by each observation has a large effect on 
the rate of convergence. For example, in a random match- 
ing game, an agents payoff provides information about the 
strategy of one other agent. On the other hand, if he receives 
his expected payoff for being matched, a single observation 
provides information about the entire distribution of strate- 
gies. In the latter case the agent can learn with many fewer 
observations. 

A related prediction is that having more agents will lead to 
faster convergence, particularly in games where payoffs are 
determined by the average behavior of other agents, because 
variance in payoffs due to exploration and mistakes decreases 
as the number of agents increases. Our experimental results 
illustrate both of these phenomena. 

We tested the learning behavior of stage learners and no- 
regret learners in a number of games, including prisoner's 
dilemma, a climbing game 6 , the congestion game described 
in [12] with both ACP and serial mechanisms, and two differ- 
ent contribution games (called a Diamond-type search model 
in [20]). We implemented payoffs both by randomly match- 
ing players and by giving each player what his expected pay- 
off would have been had he been randomly matched (some 
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Figure 1: Convergence with the average 

payoffs were adjusted to make the games symmetric). Re- 
sults were similar across the different games, so we report 
only the results for a contribution game. 

In the contribution game, agents choose strategies from 
to 19, indicating how much effort they contribute to a collec- 
tive enterprise. The value to an agent depends on how much 
he contributes, as well as how much other agents contribute. 
If he contributes x and the contribution of the other agents 
is y, then his utility is 2xy — c{x), where c(0) = 0, c(l) = 1, 
c{x) = (a: — 1)^ for a: G 2, . . . , 8 and c{x) = x'^ + 2n for x > 8. 
We considered two versions of this game. In the first, y is 
determined by the average strategy of the other agents. In 
the second, y is determined by randomly matching the agent 
with another agent. 

Our implementation of stage learners is as described in 
Section [3.31 with e — 0.05 when y is determined by the av- 
erage and e — 0.01 when y is determined by random match- 
ing. Rather than taking the length of stage r as 1/e^, we 
set T — 250 and 2000, respectively; this gives better perfor- 
mance. Our implementation of no-regret learners is based 
on that of Hart and Mas-Colell [14], with improvements sug- 
gested by Greenwald et al. [12] . 

Figure[T]shows the results for learners in the version of the 
game where y is the average strategy of other agents. Each 
curve shows the distance from equilibrium as a function of 
the number of rounds of a population of agents of a given size 
using a given learning algorithm. The results were averaged 
over 10 runs. Since the payoffs for nearby strategies are 
close, we want our notion of distance to take into account 
that agents playing 7 are closer to equilibrium (8) than those 
playing zero. Therefore, we consider the expected distance 
of p from equilibrium: '^^pia)\a — 8|. To determine p, we 
counted the number of times each action was over the length 
of a stage, so in practice the distance will never be zero 
due to mistakes and exploration. For ease of presentation, 
the graph shows only populations of size up to 100; similar 
results were obtained for populations up to 5000 agents. 

For stage learning, increasing the population size has a 
dramatic impact. With two agents, mistakes and best replies 
to the results of these mistakes cause behavior to be quite 
chaotic. With ten agents, agents successfully learn, although 
mistakes and suboptimal strategies are quite frequent. With 
one hundred agents, all the agents converge quickly to equi- 
librium strategies and mistakes are rare; almost all of the 
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Figure 2: Convergence with random matching 



distance from equilibrium is due to exploration. 

No-regret learning also converges quickly, but the "qual- 
ity" of convergence (how close we get to equilibrium) is not 
as high. The major problem is that a significant fraction of 
agents play near-optimal actions rather than optimal action. 
This may have a number of causes. First, the guarantee is 
that the asymptotic value of p will be an equilibrium, which 
allows the short periods that we consider to be far from equi- 
librium. Second, the quality of convergence depends on e, 
so tight convergence may require a much lower rate of ex- 
ploration and thus a much longer convergence time. Finally, 
this algorithm is guaranteed to converge only to a correlated 
equilibrium, which may not be a Nash equilibrium. 

Figure [2] shows the results when agent payoffs are deter- 
mined by randomly matching agents. Even for large num- 
bers of stage learners, convergence is not as tight and takes 
on the order of ten times longer. This is a result of the infor- 
mation available to agents. When payoffs were determined 
by the average strategy, a single observation was sufficient 
to evaluate a strategy, so we could use very short stages. To 
deal with the noise introduced by random matching we need 
much longer stages. The number of stages to convergence 
is similar. Even with longer stages and a large number of 
agents, mistakes are quite common. Nevertheless agents do 
successfully learn. The performance of no-regret learners is 
less affected because they use payoff information from the 
entire run of the game, while stage learners discard payoff 
information at the end of each stage. 

Convergence in the random-matching game takes approx- 
imately 20,000 rounds, which is too slow for many applica- 
tions. If a system design requires this type of matching, this 
makes learning problematic. However, the results of Fig- 
ure [T] suggest that the learning could be done much faster 
if the system designer could supply agents with more infor- 
mation. This suggests that collecting statistical information 
about the behavior of agents may be a critical feature for 
ensuring fast convergence. If agents know enough about the 
game to determine their expected payoffs from this statisti- 
cal information, then they can directly learn, as in Figure[T] 
Even with less knowledge about the game, statistical infor- 
mation can still speed learning, for example, by helping an 
agent determine whether the results of exploring an action 
were typical or due to the other agent using a rare action. 
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5. DISCUSSION 

While our results show that a natural learning algorithm 
can learn efficiently in an interesting class of games, there 
are many further issues that merit exploration. 

Other Learning Algorithms 

Our theorem assumes that agents use a simple rule for fear- 
ing within each stage: they average the value of payoffs 
received. However, there are certainly other rules for es- 
timating the value of an action; any of these can be used 
as long as the rule guarantees that errors can be made ar- 
bitrarily rare given sufficient time. It is also not necessary 
to restrict agents to stage learning. Stage learning guar- 
antees a stationary environment for a period of time, but 
such strict behavior may not be needed or practical. Other 
approaches, such as exponentially discounting the weight of 
observations [121 119) or Win or Learn Fast [3] allow an algo- 
rithm to focus its learning on recent observations and pro- 
vide a stable environment in which other agents can learn. 

Other Update Rules 

In addition to using different algorithms to estimate the val- 
ues of actions, a learner could also change the way he uses 
those values to update his behavior. For example, rather 
than basing his new strategy on only the last stage, he could 
base it on the entire history of stages and use a rule in the 
spirit of fictitious play. Since there are games where fictitious 
play converges but best-reply dynamics do not, this could 
extend our results to another interesting class of games, as 
long as the errors in each period do not accumulate over 
time. Another possibility is to update probabilistically or 
use a tolerance to determine whether to update (see e.g. [T] 
I14j). This could allow convergence in games where best- 
reply dynamics oscillate or decrease the fraction of agents 
who make mistakes once the system reaches equilibrium. 

Model Assumptions 

Our model makes several unrealistic assumptions, most no- 
tably that there are countably many agents who all share 
the same utility function. Essentially the same results holds 
with a large, finite number of agents, adding a few more 
"error terms". In particular, since there is always a small 
probability that every agent makes a mistake at the same 
time, we can prove only that no more than a 1 — e fraction 
of the agents make errors in most rounds, and that agents 
spending most of their time playing equilibrium strategies. 

We have also implicitly assumed that the set of agents is 
fixed. We could easily allow for chum: agents entering and 
leaving the system. A reasonable policy for newly-arriving 
agents is to pick a random to use in the next stage. If 
all agents do this, it follows that convergence is unaffected: 
we can treat the new agents as part of the e fraction that 
made a mistake in the last stage. Furthermore, this tells us 
that newly arriving agents "catch up" very quickly. After a 
single stage, new agents are guaranteed to have learned a 
best reply with probability at least 1 — e. 

Finally, we have assumed that all agents have the same 
utility function. Our results can easily be extended to in- 
clude a finite number of different types of agents, each with 
their own utility function, since the SLLN can be applied to 
each type of agent. We believe that our results hold even 
if the set of possible types is infinite. This can happen, for 
example, if an agent's utility depends on a valuation drawn 



from some interval. However, some care is needed to define 
best-reply sequences in this case. 

State 

One common feature of distributed systems not addressed 
in this work is state. For example, in a scrip system where 
agents pay each other for service using an internal currency 
or scrip, whether an agent should seek to provide service 
depends on the amount of money he currently has [8^. 

In principle, we could extend our framework to games 
with state: in each stage each agent chooses a policy to 
usually follow and explores other actions with probability e. 
Each agent could then use some off-policy algorithm (one 
where the agent can learn without controlling the sequence 
of observations; see [1^ for examples) to learn an optimal 
policy to use in the next stage. One major problem with this 
approach is that standard algorithms learn too slowly for our 
purposes. For example, Q-learning 125 typically needs to 
observe each state-action pair hundreds of times in practice. 
The low exploration probability means that the expected 
15*11^41/6 rounds needed to explore each even once for each 
pair is large. Efficient learning requires more specialized 
algorithms that can make better use of the structure of a 
problem, but this also makes providing a general guarantee 
of convergence more difficult. Another problem is that, even 
if an agent explores each action for each of his possible local 
states, the payoff he receives will depend on the states of 
the other agents and thus the actions they chose. We need 
some property of the game to guarantees this distribution 
of states is in some sense "well behaved." 

Despite these concerns, preliminary results suggest that 
simple learning algorithms work well for games with state. 
In experiments on a game using the model of a scrip system 
from 0, we found that a stage- learning algorithm that uses 
a specialized algorithm for determining the value of actions 
in each stage converges to equilibrium quickly despite churn 
and agents learning at different rates. 

Mixed Equilibria 

Another restriction of our results is that our agents only 
learn pure strategies. One way to address this is to discretize 
the mixed strategy space (see e.g. [?])■ If one of the resulting 
strategies is sufficiently close to an equilibrium strategy and 
best-reply dynamics converge with the discretized strategies, 
then we expect agents to converge to a near-equilibrium dis- 
tribution of strategies. We have had empirical success using 
this approach to learn to play rock-paper-scissors. 

Unexpected and Byzantine Behavior 

In practice, we expect that not all agents will be trying to 
learn optimal behavior in a large system. Some agents may 
simply play some particular (possibly mixed) strategy that 
they are comfortable with, without trying to learn a better 
strategy. Others may be learning but with an unanticipated 
utility function. Whatever their reasons, if these sufficiently 
few such agents are choosing their strategies i.i.d. from fixed 
distribtions (or at least fixed for each stage) , then our results 
hold without change. This is because we already allow an 
e fraction of agents to make arbitrary mistakes, so we can 
treat these agents as simply mistaken. 

Byzantine agents, who might wish to disrupt learning as 
much as possible, do not fit as neatly into our framework; 
they need not play the same strategy for an entire stage. 
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However, we expect that since correct agents arc random- 
izing their decisions, a small number of Byzantine agents 
should not be able to cause many agents to make mistakes. 

6. CONCLUSION 

Learning in distributed systems requires algorithms that 
are scalable to thousands of agents and can be implemented 
with minimal information about the actions of other agents. 
Most general-purpose multiagcnt learning algorithms fail one 
or both of these requirements. Wc have shown here that 
stage learning can be an efBcicnt solution in large anonymous 
games where approximate best-reply dynamics lead to ap- 
proximate pure strategy Nash equilibria. Many interesting 
classes of games have this property, and it is frequently found 
in designed games. In contrast to previous work, the time 
to convergence guaranteed by the theorem docs not increase 
with the number of agents. If system designers can find 
an appropriate game satisfying these properties on which to 
base their systems, they can be confident that nodes can 
efficiently learn appropriate behavior. 

Our results also highlight two factors that aid conver- 
gence. First, having more learners often improves perfor- 
mance. With more learners, the noise introduced into pay- 
offs by exploration and mistakes becomes more consistent. 
Second, having more information typically improves perfor- 
mance. Publicly available statistics about the observed be- 
havior of agents can allow an agent to learn effectively while 
making fewer local observations. 
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